June 6, 2017

Lesson context

Lesson context

DSCI 561 Regression I

(taken from public description of UBC MDS course DSCI 561)

Short Description

Linear models for a quantitative response variable, with multiple categorical and/or quantitative predictors. Matrix formulation of linear regression. Model assessment and prediction.

DSCI 561 Regression I

(taken from public description of UBC MDS course DSCI 561)

Course Learning Outcomes

By the end of the course, students are expected to be able to:

  • Fit and interpret a linear regression model.
  • Identify whether a linear regression model is appropriate for a given dataset.
  • Critique a specific regression model applied to a given dataset on the basis of both diagnostic plots and hypothesis tests.
  • Specify and interpret interaction terms and nonlinear terms.

DSCI 561 Regression I

(adapted from public description of UBC MDS course DSCI 561)

Reference Materials

  • Faraway, Julian J. Linear Models with R, 2nd Edition. Chapman and Hall, 2014.
  • broom package in R

Sample lesson: Fitting and interpreting linear regression models in R

1/4 of the way into the course. Before this lesson, students will have learned the following:

  • model notation in R
  • one-way ANOVA
  • linear regression via ordinary least squares

Sample lesson: Fitting and interpreting linear regression models in R

20 minute lesson learning objectives:

By the end of this lesson students are expected to be able to:

  • fit a simple linear model in R
  • interpret the output of the linear model object in R
  • use three functions from the broom package extract linear model object output

Lesson Motivation

The Data:

from Data USA: https://datausa.io/

display_name income mean_commute_minutes median_property_value non_eng_speakers_pct owner_occupied_housing_units pop
Texas 53207 24.5090 136000 0.3503480 0.622325 26538614
Pennsylvania 53599 25.2695 166000 0.1064820 0.692052 12779559
South Carolina 45483 23.0552 139900 0.0686766 0.685914 4777576
New Hampshire 66779 25.2973 237300 0.0786821 0.709609 1324201
Kansas 52205 18.2779 132000 0.1130530 0.666891 2892987
Hawaii 69515 25.5813 515300 0.2521790 0.569030 1406299

Simple linear regression:

State property value as a function of income

\(Y = \textrm{State}\: \textrm{median}\: \textrm{property}\: \textrm{value}\)
\(X_{1} = \textrm{income}\)

Model notation for linear regression in R

Simple linear regression:

mathematical forumula:
\(Y = \beta_{0} + \beta_{1}X_{1} + \varepsilon\)

model forumula in R:
Y ~ X1

More complex linear regression:

mathematical forumula:

\(Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + \beta_{3}X_{3} + \varepsilon\)

\(Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + \beta_{3}X_{1}X_{2} + \varepsilon\)

model forumula in R:

Y ~ X1 + X2 + X3

Y ~ X1 * X2

Cheatsheet for model notation in R:

http://faculty.chicagobooth.edu/richard.hahn/teaching/formulanotation.pdf

Syntax for linear regression in R

\(\textrm{State}\: \textrm{median}\: \textrm{property}\: \textrm{value} = \beta_{0} + \beta_{1}\textrm{income} + \varepsilon\)

prop_model <- lm(median_property_value ~ income, data = prop_data)

Check-in question:

Syntax for simple linear regression

create linear model object:

prop_model <- lm(median_property_value ~ income, data = prop_data)

print/view model object output:

summary(prop_model)
## 
## Call:
## lm(formula = median_property_value ~ income, data = prop_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -65249 -36542  -6990   8003 219312 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.503e+05  4.393e+04  -3.422  0.00125 ** 
## income       6.420e+00  7.999e-01   8.026 1.52e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 58860 on 50 degrees of freedom
## Multiple R-squared:  0.563,  Adjusted R-squared:  0.5542 
## F-statistic: 64.41 on 1 and 50 DF,  p-value: 1.517e-10

Decoding R’s output from summary():

Extracting output from model object in base R:

model output code
parameter/coefficient estimates model_object$coefficients
residuals model_object$residuals
predicted/fitted values model_object$fitted.values
p-values for coefficients summary(model_object)$coefficients[,4]
\(\sigma\) estimate summary(model_object)$sigma
\(R^{2}\) summary(model_object)$r.squared

Working with model output in base R,

the good, the bad and the ugly…

The good…

  • all the information you want it there
  • this has been used for many many many years and thus should be familiar to most Data
  • Scientists and Statisticians

The bad…

  • inconsistent syntax to extract model output
  • model output is returned in a variety of forms, and is not tidy data

The ugly…

  • bizarre symbols in some column names (e.g., summary(model_object)$coefficient)
  • F-statistic p-value is never stored in memory and thus must be calculated by hand

broom:

A better way for working with model output in R

broom for working with model output in R:

The good…

  • all the information you want it there, and easy to access
  • consistent syntax
  • no weird column names

The bad…

  • it’s new, so not everyone is familiar with it

The ugly…

  • ???
broom function model output
tidy(model_object) | model parameters (e.g., coefficients and p-values) | |augment(model_object)| model data (e.g., residuals and predicted values) | |glance(model_object)` model quality, complexity and summaries (e.g., \(\sigma\) estimate and \(R^{2}\))

Group challenge question:

What did we learn today?

Where do we go from here?

  • fitting in interpreting more complex linear models in R, revisit:
    • reference-treatment parameterization in R
    • R’s matrix representation of X (model.matrix)
  • model diagnostics (e.g., how do we know if we have a good model? and what to do if we do not…)